Authorship Attribution

In this project, we train our model with writings of different authors and try to predict the correct author. In this version, we are going to predict the results using bag of words technique and Naive Bayes prediction algorithms.

Input Data Preprocessing:

We received the writing assignments from the softskills department. This data has many null values(missing assignments) and repeated values(Student Details and Question related Information). We dropped all the unwanted data and missing student assignmets. This data is used to generate a csv file which has all student information.

Reading the data

First, we should load the data and print any five data points


In [39]:
import pandas as pd

df = pd.read_csv('scan1.csv',sep=',', header=None, names=['author_label','ass_num', 'author_writing'])


# df = pd.read_csv('bow3.csv',sep=',', header=None, names=['author_label', 'author_writing'])
# Output printing out last 5 columns

df = df.drop('ass_num', axis=1)
df.tail()

# print len(df['author_writing'][0].split(" "))


Out[39]:
author_label author_writing
1023 99 Exercise 1Make the following sentences more co...
1024 99 Exercise 1Make the following sentences more co...
1025 99 Q. Make the sentences more concise:1. We certa...
1026 99 Listening and Note taking: As I understood to ...
1027 99 Sir/Madam I am venkatWhen i was trolled your w...

Check the number of data points. Our dataset contains 1028 tuples. and two columns.


In [40]:
print(df.shape)


(1028, 2)

Splitting the data into Train and Test sets

We should train the model before testing. But we need a training set and a testing set. So, we should divide the data into test and train sets by using train_test_split module from sklearn.


In [41]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['author_writing'],df['author_label'],random_state=1)

print('Number of rows in the total set: {}'.format(df.shape[0]))
print('Number of rows in the training set: {}'.format(X_train.shape[0]))
print('Number of rows in the test set: {}'.format(X_test.shape[0]))


Number of rows in the total set: 1028
Number of rows in the training set: 771
Number of rows in the test set: 257

Applying Bag of Words

To apply the Naive Bayes theorem to our dataset, we should convert all our data into numeric values since sklearn can't work with non numeric values. So, we created a frequency matrix(word frequency) of our dataset and apply Bayes theorem to that. So, we used a module called CountVectorizer to that.


In [42]:
from sklearn.feature_extraction.text import CountVectorizer

count_vector = CountVectorizer()

# Fit the training data and then return the matrix
training_data = count_vector.fit_transform(X_train)

# Transform testing data and return the matrix. Note we are not fitting the testing data into the CountVectorizer()
testing_data = count_vector.transform(X_test)

Now, we should apply bayes technique for this dataset. So, we should import MultinomialNB module from sklearn. Here, we use Multinimial Naive Bayes because it good for classification with discrete values.


In [43]:
from sklearn.naive_bayes import MultinomialNB
naive_bayes = MultinomialNB()
naive_bayes.fit(training_data, y_train)


Out[43]:
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [44]:
## Now that we trained our model, it's time to test the model with the dataset.

predictions = naive_bayes.predict(testing_data)

The performance our model can be known by computing the accuracy, precision, recall and the f1 score of our model.


In [45]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print('Accuracy score: ', format(accuracy_score(y_test, predictions)))
print('Precision score: ', format(precision_score(y_test, predictions,average="weighted")))
print('Recall score: ', format(recall_score(y_test, predictions,average="weighted")))
print('F1 score: ', format(f1_score(y_test, predictions,average="weighted")))


('Accuracy score: ', '0.0155642023346')
('Precision score: ', '0.0583657587549')
('Recall score: ', '0.0155642023346')
('F1 score: ', '0.0230127848805')

Conclusion:

Since we got very less scores, we can't just conclude that our model is wrong. The other factors can also affect. Here we found out some factors which affected the scores.

  1. The data is not sufficient enough to train the model.
  2. Data contains a lot of noise. Quality of data is not good.
  3. The questions are not open ended. So the students are forced to write on the topic which they are given, which is not helpful in classifying a writer.

In [ ]: